Collocational analysis in Japanese text input
نویسندگان
چکیده
This paper proposes a new disambiguation method for Japanese text input. This method evaluates candidate sentences by measuring the number of Word Co-occurrence Patterns (WCP) included in the candidate sentences. An automatic WCP extraction method is also developed. An extraction experiment using the example sentences from dictionaries confirms that WCP can be collected automaticMly with an accuracy of 98.7% using syntactic analysis and some heuristic rules to eliminate erroneous extraction. Using this method, about 305,000 sets of WCP are collected. A cooccurrence pattern matrix with semantic categories is built based on these WCP. Using this matrix, the mean number of candidate sentences in Kana.-to-Kanji translation is reduced to about 1/10 of those fi-om existing morphological methods. 1 . I n t r o d u c t i o n For keyboard input of Japanese, Kana-to-kanji translation method [Kawada79] [Makino80] [Abe86] is the most popular technique. In this method, Kana input sentences are translated automatically into Kanji-Kana sentences. However, non-segmentcd Kana input is highly ambiguous, be.. cause of the segmentation ambiguities of Kana input into morphemes, and homonym ambiguities. Some research has been carried out mainly to overcome homonym ambiguity using a word usage dictionary [Makino80] and by using case grammar [Abe86]. A new technique named collocational analysis method, is proposed to overcome both ambiguities. This evaluates the certainty of candidate sentences by measuring the number of co-occurrence patterns between word paix~. It is used in addition to the usual morphological analysis. To realize this, it is essential to build a dictionary which can reflect Word Co-occurrence Patterns (WCP). In English processing research, there has been an attempt [Grishman86] to collect semi-automatically sublanguage selectional patterns. In Japanese processing research, there have been attempts [Shirai86] [Tanaka86] to collect combinations of words with this kind of relationship, eittmr completelyor semi-automatically. These two attempts did not provide a dictionary for practical use. A new method is proposed for building a dictionary which accumulates WCP. The first feature of this method is the collection of WCP from the common combination of two words having a dependency relationship in a sentence, because these common combinations will most likely reoccur in future texts. In this method, it is important to identify dependency relationships between word pai~s, instead of identifying, the whole dependency structure of the sentence. For this purpose, Dependency Localization Analysis (DLA) is used. This identifies the word pairs having a definite dependency relationship using syntactic analysis and some heuristic rules. This paper will first describe eoUocational analysis, a new concept in Kana-to-Kanji translation, then the compilation of WCP dictionary, next the translation Mgorithm and finMly translation experimental results. 2. C o n c e p t o f C o l l o e a t l o n a l A n a l y s i s in T r a n s l a t i o n CollocationM analysis evaluates the correctness of a translated sentence by measuring the WCP within the sentence. The WCP data is accmnulated in a 2-dimensional matrix, by information milts indicating more restricted concepts than the words can indica.te by themselves. As previously mentioned there are two kinds of ambiguities in Kana-to-Kanji translation. In Fig.i, disambiguation process of homonyms is illustrated. ' NA;R (a national anthem) and ~[~'~~ (to p lay) ' and ' NAg( (a state) aud ~.~-~;5 (to build)' etc. are examples of WCP. If the simple Kana sequence ' ~_ -~ h~ ~.~./~ ~ 5 ~" ;5 [kokkaoensousuru]' is input, the usual translation system will develop two possible candidate words ' NJN ' (a national anthem) and ' NAg( (a state)', for the partial Kana sequence of ' ~ ~J h~ [kokk@ The system will also develop uniquely the creed(date word, ' ~ ¢ ;5 (to play) ' f o r ' R./~ <-) -~;5 [ensousumq'. These candidate words are obtained by table searching and morphologicM analysis. Itowever, morphological analysis alone can't identify which one is correct for ' ~. o h~ [kokka]. Using eo!loeationM analysis, ~he WCP of ' NA ~.7,~(a state)' and ' ~ ;5 (to play)' is found to be nil, while that o f ' NA~ (a national anthem)' and ' ~ ;5 (to play)' is found to be probable. Using WCP, ' NA ~ik ~ ~ ~ ~" -5 (to play a national anthem)' is selected as the final candidate sentence. If the Kana sequence' c o h~ ~ l:Y/b -t~ ~ ~" .5 [kokkaokensetsusnru]' is input, ' N A ~ $ k ~ : ;5 (to build a state)' is obtained in same manner. E] ~ H o m o n y m s f o r [ ~'j~ ~ -~ ~ I (Japanese] ' ,Z-)~ ' ( k o k k a ) ] j , ~ _ _ ~ t o p t a y ) ~ . ~ 'l~-I~ ~ ' (nihon) L~ ~ V '~'~'~° -~ ~ ' (a national anthem) I (enaousuru)
منابع مشابه
Collocational Aid for Learners of Japanese as a Second Language
We present Collocation Assistant, a prototype of a collocational aid designed to promote the collocational competence of learners of Japanese as a second language (JSL). Focusing on noun-verb constructions, the tool automatically flags possible collocation errors and suggests better collocations by using corrections extracted from a large annotated Japanese language learner corpus. Each suggest...
متن کاملA Correlational Study of Expectancy Grammar’s Manifestation on Cloze Test and Lexical Collocational Density
The notion of expectancy grammar as a key to understanding the nature of psychologically real processes that underlie language use is introduced by Oller (1979). A central issue in this notion is that expectancy generating systems are constructed and modified in the course of language acquisition. Thus, one of the characteristics of language proficiency is that it consists of such an expectancy...
متن کاملJapanese Learners’dictionary of I-adjective-noun Collocations
This paper demonstrates a method for creating Japanese learners dictionary of i-adjective-noun collocations. After an introduction of the importance of collocations and the necessity of their inclusion in Japanese language learning, we present various corpora types and corpus query tools that are used to obtain variety of collocational usage in different types of discourse. The Japanese languag...
متن کاملApplying a Hybrid Query Translation Method to Japanese/English Cross-Language Patent Retrieval
This paper applies an existing query translation method to cross-language patent retrieval. In our method, multiple dictionaries are used to derive all possible translations for an input query, and collocational statistics are used to resolve translation ambiguity. We used Japanese/English parallel patent abstracts to perform comparative experiments, where our method outperformed a simple dicti...
متن کاملA Web Corpus and Word Sketches for Japanese
Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe the development of JpWaC (Japanese Web as Corpus), a large corpus of 400 million words of Japanese web text, and its encoding for the Sketch Engine. The Sketch Engine is a web-based corpus query tool that supports fast concordancing, grammatical processi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1988